Goto

Collaborating Authors

 test split




Supplementary Materials for On the Effects of Data Scale on Computer Control Agents

Neural Information Processing Systems

For completeness, in the following we include a datasheet based on the format of [1]. For what purpose was the dataset created? Was there a specific task in mind? Who created the dataset (e.g., which team, research group) and on behalf of which entity What do the instances that comprise the dataset represent (e.g., documents, photos, people, The dataset contains episodes of human demonstrations for mobile device control. How many instances are there in total (of each type, if appropriate)?



[R1/R2] Infinite width assumption: the infinite width assumption is needed due to the technical detail that the norm

Neural Information Processing Systems

We thank reviewers for their valuable comments. We respond to the main concerns below. Similar to that in Zhang et al. [31], we chose 10k block ResNet to stress the We will rephrase L243 to better express this. Derivative of weights depend on this term due to the chain rule. We will make this explicit in the revised manuscript.



f3ada80d5c4ee70142b17b8192b2958e-Supplemental.pdf

Neural Information Processing Systems

First, a random patch of the image is selected and resized to224 224 with a random horizontal flip, followed byacolor distortion, consisting ofarandom sequence ofbrightness, contrast, saturation, hue adjustments, and anoptional grayscale conversion. FinallyGaussian blur and solarization are appliedtothepatches. Optimization We use theLARS optimizer [70] with a cosine decay learning rate schedule [71], without restarts, over1000epochs, with awarm-up period of10epochs. Wesetthebase learning rate to 0.2, scaled linearly [72] with the batch size (LearningRate = 0.2 BatchSize/256). Forthetargetnetwork,the exponential moving average parameterτ starts fromτbase = 0.996and is increased to one during training.